[WIP] LM Workload #860

rka97 · 2025-04-03T17:44:18Z

This is for the LM workload.

Dev -> main

Updates to LM PR

- Added `limit_tf_threads` parameter to `pytorch_init` to control TensorFlow threading based on workload type. Dataloader was going OOM otherwise. - Updated input pipeline to support "None" for weights (for memory). - Modified Transformer model's `forward` method to optionally return loss during training. Should be better to fuse the loss later. - Adjusted torch LM workload configuration for model dimensions and parameters to match jax. - Updated transformers version in `pyproject.toml`, older version seems unavailable.

…ency into lm_workload

… pytorch calls detatch

…modules

algoperf/workloads/lm/input_pipeline.py

…prove clarity

rka97 · 2025-10-21T05:10:46Z

Adding some TODOs:

Add model matching tests (i.e. for the same inputs, outputs should be the same, kinda like model_match)
Fix initialization to be the same for both PyTorch and JAX (there are some minor differences it seems)
Add integration test for the lm workload to github actions

…JAX and PyTorch, also unify initialization to be the same in both

…uding learned scaling factor

priyakasimbeg

Second round of small requested changes.

Perhaps something we should discuss, we need a more descriptive name for the workload. E.g. fineweb_edu_lm. What do you all think? @Niccolo-Ajroldi @rka97

priyakasimbeg · 2025-10-24T07:41:52Z

algoperf/workloads/lm/lm_jax/nanodo_model.py

@@ -0,0 +1,397 @@
+"""
+Originally based on code from the NanoDO repository under the Apache 2.0 license:


Can we rename this file to models.py to be consistent with the pattern in the other workload definitions.

priyakasimbeg · 2025-10-24T07:42:15Z

algoperf/workloads/lm/lm_pytorch/plainlm_model.py

@@ -0,0 +1,344 @@
+"""
+Originally based on the plainLM codebase:


Can we rename this file to models.py to be consistent with the other workload definitions

priyakasimbeg · 2025-10-24T07:43:46Z

algoperf/workloads/workloads.py

    'workload_path': 'librispeech_deepspeech/librispeech',
    'workload_class_name': 'LibriSpeechDeepSpeechNormAndSpecAugWorkload',
  },
+  'lm': {'workload_path': 'lm/lm', 'workload_class_name': 'LmWorkload'},


Now that we have all the important implementation details figured out should we pick a more descriptive name for the workload? I am thinking perhaps 'fineweb_edu_lm'?

priyakasimbeg · 2025-10-24T07:45:39Z

algorithms/archived_paper_baselines/adamw/pytorch/submission.py

  elif workload_name == 'mnist':
    return 16
+  elif workload_name == 'lm':
+    return 4


This should work for bsz 64 right?

priyakasimbeg · 2025-10-24T07:46:14Z

algorithms/archived_paper_baselines/nesterov/jax/submission.py

  elif workload_name == 'cifar':
    return 128
+  elif workload_name == 'lm':
+    return 8


Should work for bsz 64 right?

priyakasimbeg and others added 30 commits February 27, 2025 14:56

Merge pull request #847 from mlcommons/dev

1d81455

Dev -> main

first LM commit

da5f85a

lm data pipeline

a12a364

testing

ca83ab8

LM workload tested torch pipeline

e3e78dc

LM workload - fix torch tests

e619495

add LM tests, remove dev files

d8e9c56

add LM tests, remove dev files

6b4ff12

Stop tracking .gitignore

3c5c847

Remove dev/ from repo, keep locally

20d841b

fix comments

f3ba059

add class specifications

381451f

add workload LM info

f111d2e

restore data_utils.py tree map

808d398

fixed NFS bug

35f8f89

train/val split before concat

cbb6ee6

renamed datasets to avoid conflict with HF

868987c

Merge remote-tracking branch 'upstream/lm_workload' into lm_workload

8191f6d

renamed datasets to dataset

dd59ded

fix style

496b9c3

fix formatting

50989eb

fix style

5af0fdc

fix style

2683099

fix yapf

6b7ee29

fix style

46b645b

HF datasets pipeline

b3ae647

Testing with linear model

f095d4b

Merge branch 'jit_switch' into lm_workload

4189ae0

lm workload with linear model

0c22f3d

add nanodo model

99c7b9b

priyakasimbeg and others added 9 commits October 10, 2025 04:45

add todo for pytorch _eval_batch cleanup

617e1a3

Merge pull request #891 from mlcommons/lm_workload_priya

bebc80a

Updates to LM PR

add target setting algorithm for fineweb edu lm workload

64ea658

update step hint for lm workload

b38ade0

update target

65369f2

update eval split sizes for lm workload and target setting point

6171b2d

Merge branch 'lm_workload' of github.com:mlcommons/algorithmic-effici…

f111aea

…ency into lm_workload

Fix OOM bug in lm eval

1f0439a

rka97 force-pushed the lm_workload branch from c20c88f to 1f0439a Compare October 18, 2025 06:46

rka97 and others added 10 commits October 18, 2025 20:42

repeat dataset

b11c193

label smoothing default fix

42d1d1a

finish merge

c334c97

Make sure to take the correct number of batches in lm

d95f2bf

Merge branch 'lm_workload' of github.com:mlcommons/algorithmic-effici…

7deb070

…ency into lm_workload

Properly handle repetition in LM training and evaluation splits

0dc16db

move eval_batch from shared class to framework specific classes since…

7edb702

… pytorch calls detatch

finish merge

0879e68

Refactor imports and clean up unused code in LM workload and related …

73e3ea6

…modules

pass linter checks

91988af

priyakasimbeg requested changes Oct 21, 2025

View reviewed changes

algoperf/workloads/lm/input_pipeline.py Show resolved Hide resolved

algoperf/workloads/lm/input_pipeline.py Show resolved Hide resolved

Refactor loss function in LM workloads to unify label handling and im…

bb4a380

…prove clarity

rka97 force-pushed the lm_workload branch from f6a705d to bb4a380 Compare October 21, 2025 05:06

rka97 added 2 commits October 21, 2025 08:46

Fix init in both models to be the same, add lm model diff test

a58fbd5

Refactor model configuration classes to make them consistent between …

b59afa0

…JAX and PyTorch, also unify initialization to be the same in both

rka97 force-pushed the lm_workload branch from 2251c3e to b59afa0 Compare October 21, 2025 09:07

Add query-key normalization to CausalAttn and Attention classes, incl…

d35cdde

…uding learned scaling factor

rka97 force-pushed the lm_workload branch from bbde48b to d35cdde Compare October 23, 2025 17:11

priyakasimbeg requested changes Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] LM Workload #860

[WIP] LM Workload #860

Uh oh!

rka97 commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

rka97 commented Oct 21, 2025 •

edited

Loading

Uh oh!

priyakasimbeg left a comment

Uh oh!

priyakasimbeg Oct 24, 2025

Uh oh!

priyakasimbeg Oct 24, 2025

Uh oh!

priyakasimbeg Oct 24, 2025

Uh oh!

priyakasimbeg Oct 24, 2025

Uh oh!

priyakasimbeg Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,397 @@
		"""
		Originally based on code from the NanoDO repository under the Apache 2.0 license:

		@@ -0,0 +1,344 @@
		"""
		Originally based on the plainLM codebase:

[WIP] LM Workload #860

Are you sure you want to change the base?

[WIP] LM Workload #860

Uh oh!

Conversation

rka97 commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

rka97 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

priyakasimbeg left a comment

Choose a reason for hiding this comment

Uh oh!

priyakasimbeg Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

priyakasimbeg Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

priyakasimbeg Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

priyakasimbeg Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

priyakasimbeg Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rka97 commented Oct 21, 2025 •

edited

Loading